Implement predict_proba as per #138 #211

Mec-iS · 2022-10-31T19:23:32Z

Last test fit_predict_probabilities is failing, I don't know if because of the implementation or I am missing something in replicating the test in sklearn:

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(criterion="gini")
clf.fit(x, y)
print(clf.predict_proba(x))

[[0.99 0.01]
 [0.82 0.18]
 [0.97 0.03]
 [0.8  0.2 ]
 [0.99 0.01]
 [0.9  0.1 ]
 [0.99 0.01]
 [0.91 0.09]
 [0.23 0.77]
 [0.4  0.6 ]
 [0.   1.  ]
 [0.   1.  ]
 [0.   1.  ]
 [0.   1.  ]
 [0.   1.  ]
 [0.   1.  ]
 [0.01 0.99]
 [0.02 0.98]
 [0.   1.  ]
 [0.01 0.99]]

Add test to predict probabilities

dlrobson · 2023-01-26T19:22:17Z

Hey guys! My team and I have just started using this library for our project. This functionality would be awesome for what we're doing. Is there any update on the progress of this?

morenol · 2023-01-27T00:24:17Z

Hey guys! My team and I have just started using this library for our project. This functionality would be awesome for what we're doing. Is there any update on the progress of this?

Hey, I think that there have not been more updates, none of the contributors have had time to finish it

Mec-iS · 2023-01-27T10:44:56Z

Hey guys! My team and I have just started using this library for our project. This functionality would be awesome for what we're doing. Is there any update on the progress of this?

It would be nice to have somebody with understanding of this feature, please help if you know how to implement this. It seems there are some differences between the results returned by smartcore and the ones returned by sklearn.

lars-frogner · 2024-08-05T10:49:55Z

Hi! Thanks for providing this nice library, we are finding the random forest implementations really useful.

I am very interested in getting predicted class probabilities from the random forest classifier, so I have been looking into this issue.

As far as I can tell, the way sklearn does it is recording the per-class sample counts in each node when fitting a decision tree. The class probabilities returned by DecisionTreeClassifier.predict_proba are then the per-class sample counts for each the predicted node divided by the total sample count of the node. The predict_proba method of RandomForestClassifier then calls predict_proba for each tree and averages the resulting probabilities.

I have implemented this in a separate smartcore fork, and it gives results that are quite close to the RandomForestClassifier::predict_proba implementation in this branch. The latter gives a bit more coarse-grained results since it uses the predicted classes rather than the underlying probabilities.

Since the results are pretty similar to before, they still deviate by up to 10% from sklearn's class probabilities for the input used in the fit_predict_probabilities test. But there are other inputs that give the exact same probabilities as sklearn. So I suspect the differences are not due to the predict_proba implementation, but rather a result of different splitting policies when building a decision tree.

Here is a small test case showing a difference in splitting between smartcore and sklearn:

This sklearn test passes:

from sklearn.tree import DecisionTreeClassifier

X = [
    [1., 1., 0.],
    [1., 1., 0.],
    [1., 1., 1.],
    [1., 0., 0.],
    [1., 0., 1.],
]

y = [1, 1, 0, 0, 1]

dt = DecisionTreeClassifier()
dt.fit(X, y)

assert dt.tree_.node_count == 7

This corresponding smartcore test fails, since the fit results in a tree with only a single node.

let x = DenseMatrix::from_2d_array(&[
    &[1., 1., 0.],
    &[1., 1., 0.],
    &[1., 1., 1.],
    &[1., 0., 0.],
    &[1., 0., 1.],
])
.unwrap();

let y = vec![1, 1, 0, 0, 1];

// We use the same defaults as sklearn
let classifier =
    DecisionTreeClassifier::fit(&x, &y, DecisionTreeClassifierParameters::default())
        .unwrap();

assert_eq!(classifier.nodes().len(), 7);

This might be old news, but I think it shows that we can't expect to get the same probabilities as sklearn, at least not without first replicating their exact splitting policy (which seems more complicated).

Since we have now tried two different predict_proba implementations that give similar results and also agree with sklearn for inputs where the splitting is more straightforward, it seems safe to me to proceed with merging one of them. The implementation in my fork has the advantage of providing predict_proba for DecisionTreeClassifier in addition to RandomForestClassifier and that the probabilities are probably a bit more precise. The disadvantage is having to store num_features * num_classes counts in every decision tree. An option could be to add a keep_counts parameter to DecisionTreeClassifierParameters and let predict_proba fail if the counts were not kept, similarly to keep_samples for RandomForestClassifier.

AlanRace and others added 9 commits July 11, 2022 16:08

Added per-class probability prediction for random forests

663db03

Add test

2603a1f

Add test

61db4eb

Merge pull request #1 from smartcorelib/alanrace-predict-probs

b6fb819

Add test to predict probabilities

Merge branch 'development' into predict-probability

d46b830

Fixed test by transposing matrix

7f7b2ed

Test case now passing without transpose

28c81eb

Merge remote-tracking branch 'sm/development' into predict-probability

e9ed9e8

Fix conflicts

a9f89a2

Mec-iS requested review from montanalow and morenol October 31, 2022 19:23

Mec-iS changed the title ~~Implent predict_proba as per #138~~ Implement predict_proba as per #138 Oct 31, 2022

Mec-iS mentioned this pull request Oct 31, 2022

Added per-class probability prediction for random forests #138

Closed

Mec-iS added 3 commits October 31, 2022 19:28

apply fmt

78e53a2

Merge branch 'development' into prdct-prb

c56370d

Merge branch 'development' into prdct-prb

24d80a0

Merge branch 'development' into prdct-prb

dae5567

Merge branch 'development' into prdct-prb

97604a2

Merge branch 'development' into prdct-prb

5bf7102

Mec-iS mentioned this pull request Apr 7, 2023

Patch to version 0.4.0 #257

Merged

Mec-iS mentioned this pull request Apr 8, 2024

Add predict_proba method for LogisticRegression #278

Closed

3 tasks

Merge branch 'development' into prdct-prb

cfc953b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement predict_proba as per #138 #211

Implement predict_proba as per #138 #211

Mec-iS commented Oct 31, 2022 •

edited

Loading

dlrobson commented Jan 26, 2023

morenol commented Jan 27, 2023

Mec-iS commented Jan 27, 2023 •

edited

Loading

lars-frogner commented Aug 5, 2024

Implement predict_proba as per #138 #211

Are you sure you want to change the base?

Implement predict_proba as per #138 #211

Conversation

Mec-iS commented Oct 31, 2022 • edited Loading

dlrobson commented Jan 26, 2023

morenol commented Jan 27, 2023

Mec-iS commented Jan 27, 2023 • edited Loading

lars-frogner commented Aug 5, 2024

Mec-iS commented Oct 31, 2022 •

edited

Loading

Mec-iS commented Jan 27, 2023 •

edited

Loading